Skip to main content

Worldwide Web Crawls

Wide crawls of the Internet conducted by Internet Archive. Please visit the Wayback Machine to explore archived web sites.

Since September 10th, 2010, the Internet Archive has been running Worldwide Web Crawls of the global web, capturing web elements, pages, sites and parts of sites.

Each Worldwide Web Crawl was initiated from one or more lists of URLs that are known as "Seed Lists". Descriptions of the Seed Lists associated with each crawl may be provided as part of the metadata for each Crawl.

Worldwide Web Crawls are run using the Heritrix software.

In addition various rules are also applied to the logic of each crawl. Those rules define things like the depth the crawler will try to reach for each host (website) it finds. In general terms the crawling software will identify all the URLs on each page it captures, follow those links, attempt to capture those pages, identify new URLs, follow those links, etc., till the crawl is stopped or pre-set conditions like site depth limits are reached. For the most part a given host will only be captured once per Worldwide Web Crawl, however it might be captured more frequently (e.g. once per hour for various news sites) via other crawls.

587,265
RESULTS
rss


PART OF
Internet Archive Web Crawls
Media Type
19
collections
586,671
web
575
data
Year
4,530
2019
45,060
2018
93,276
2017
94,069
2016
94,475
2015
56,545
2014
More right-solid
Topics & Subjects
586,671
crawldata
252
amazonbooks
Collection
More right-solid
Creator
586,339
internet archive
3
lekash@archive.org
Language
3
English
SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
collection
eye 818M
The seed for Wide00014 was: - Slash pages from every domain on the web: -- a list of domains using Survey crawl seeds -- a list of domains using Wide00012 web graph -- a list of domains using Wide00013 web graph - Top ranked pages (up to a max of 100) from every linked-to domain using the Wide00012 inter-domain navigational link graph -- a ranking of all URLs that have more than one incoming inter-domain link (rank was determined by number of incoming links using Wide00012 inter domain links)...
Wide Crawl started April 2013
collection
25,035
ITEMS
704.4M
VIEWS
collection
eye 704.4M
Web wide crawl with initial seedlist and crawler configuration from April 2013.
Wide Crawl started June 2014
collection
45,341
ITEMS
579.7M
VIEWS
collection
eye 579.7M
Web wide crawl with initial seedlist and crawler configuration from June 2014.
Wide Crawl Number 12 - started March, 14th 2015
collection
49,621
ITEMS
569.7M
VIEWS
collection
eye 569.7M
Web wide crawl with initial seedlist and crawler configuration from January 2015.
Wide Crawl started August 2013
collection
21,932
ITEMS
454.5M
VIEWS
collection
eye 454.5M
Web wide crawl with initial seedlist and crawler configuration from August 2013.
collection
eye 438.8M
Web wide crawl.
Wide Crawl started January 2012
collection
30,373
ITEMS
403.6M
VIEWS
collection
eye 403.6M
Web wide crawl with initial seedlist and crawler configuration from January 2012 using HQ software.
Wide Crawl Number 13
collection
46,050
ITEMS
398.7M
VIEWS
collection
eye 398.7M
Web Wide Crawl Number 13
Wide Crawl started April 2012
collection
39,279
ITEMS
358.4M
VIEWS
collection
eye 358.4M
Web wide crawl with initial seedlist and crawler configuration from April 2012.
collection
eye 300M
Web wide crawl number 16 The seed list for Wide00016 was made from the join of the top 1 million domains from CISCO and the top 1 million domains from Alexa.
Wide Crawl started October 2010
collection
15,839
ITEMS
268.7M
VIEWS
collection
eye 268.7M
Web wide crawl with initial seedlist and crawler configuration from October 2010
Wide Crawl Started January 2013
collection
15,157
ITEMS
263.7M
VIEWS
collection
eye 263.7M
Wide crawls of the Internet conducted by Internet Archive. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
Wide Crawl started September 2012
collection
22,423
ITEMS
258M
VIEWS
collection
eye 258M
Web wide crawl with initial seedlist and crawler configuration from September 2012.
Wide Crawl started February 2014
collection
9,806
ITEMS
253.1M
VIEWS
collection
eye 253.1M
Web wide crawl with initial seedlist and crawler configuration from February 2014.
Wide Crawl started October 2011
collection
12,648
ITEMS
237.4M
VIEWS
collection
eye 237.4M
Web wide crawl with initial seedlist and crawler configuration from March 2011 using HQ software.
Wide Crawl started March 2011
collection
8,528
ITEMS
218.6M
VIEWS
collection
eye 218.6M
Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi. What’s in the data set: Crawl start date: 09 March, 2011 Crawl end date: 23 December, 2011 Number of captures: 2,713,676,341 Number of unique URLs: 2,273,840,159 Number of hosts: 29,032,069 The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT)...
collection
eye 128.4M
Wide17 was seeded with the "Total Domains" list of 256,796,456 URLs provided by  Domains Index   on June 26th, and crawled with max-hops set to "3" and de-duplication set "on".   
Wide Crawl started September 2010
collection
332
ITEMS
7.5M
VIEWS
collection
eye 7.5M
Web wide crawl with initial seedlist and crawler configuration from September 2010
Wide Crawl started January 2012
web
eye 5.1M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl423.us.archive.org:wide from Tue Jan 17 08:02:53 PST 2012 to Tue Jan 17 01:16:20 PST 2012.
Topic: crawldata
Wide Crawl started February 2014
web
eye 3.8M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl453.us.archive.org:wide from Wed Feb 19 01:09:37 PST 2014 to Tue Feb 18 21:33:27 PST 2014.
Topic: crawldata
Wide Crawl started February 2014
web
eye 3.8M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl454.us.archive.org:wide from Wed Feb 19 05:20:19 PST 2014 to Wed Feb 19 01:54:33 PST 2014.
Topic: crawldata
Wide Crawl started February 2014
web
eye 3.8M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl420.us.archive.org:wide from Tue Feb 18 17:01:58 PST 2014 to Tue Feb 18 13:14:06 PST 2014.
Topic: crawldata
Wide Crawl started February 2014
web
eye 3.8M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl427.us.archive.org:wide from Wed Feb 19 09:49:01 PST 2014 to Wed Feb 19 06:07:15 PST 2014.
Topic: crawldata
Wide Crawl started February 2014
web
eye 3.6M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Wed Feb 19 07:58:38 PST 2014 to Wed Feb 19 05:13:46 PST 2014.
Topic: crawldata
Host Screen Captures
collection
17,131
ITEMS
3.5M
VIEWS
collection
eye 3.5M
Screen captures of hosts discovered during wide crawls. This data is currently not publicly accessible.
Wide Crawl started February 2014
web
eye 3.1M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Wed Feb 19 08:18:23 PST 2014 to Wed Feb 19 04:21:37 PST 2014.
Topic: crawldata
Wide Crawl started February 2014
web
eye 3.1M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Tue Feb 18 22:58:46 PST 2014 to Tue Feb 18 19:25:19 PST 2014.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Mon Feb 12 21:42:38 PST 2018 to Mon Feb 12 15:20:34 PST 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl422.us.archive.org:wide from Wed Jan 4 01:00:14 PST 2017 to Tue Jan 3 19:50:56 PST 2017.
Topic: crawldata
Wide Crawl started February 2014
web
eye 1.9M
favorite 1
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl416.us.archive.org:wide from Sat Feb 8 03:46:42 PST 2014 to Fri Feb 7 23:17:16 PST 2014.
Topic: crawldata
Wide Crawl started February 2014
web
eye 1.8M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl414.us.archive.org:wide from Sat Feb 8 04:46:28 PST 2014 to Sat Feb 8 00:01:23 PST 2014.
Topic: crawldata
Wide Crawl started January 2012
web
eye 1.6M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl413.us.archive.org:wide from Sat Jan 21 04:01:50 PST 2012 to Fri Jan 20 21:01:34 PST 2012.
Topic: crawldata
Wide Crawl started April 2013
web
eye 1.4M
favorite 1
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl415.us.archive.org:wide from Wed May 15 12:25:51 PDT 2013 to Wed May 15 06:56:55 PDT 2013.
Topic: crawldata
Wide Crawl Number 12 - started March, 14th 2015
web
eye 1.4M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Sat Mar 14 23:38:56 PDT 2015 to Sat Mar 14 17:31:22 PDT 2015.
Topic: crawldata
Wide Crawl started June 2014
web
eye 1.3M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Thu Jul 10 06:43:41 PDT 2014 to Thu Jul 10 01:23:01 PDT 2014.
Topic: crawldata
Wide Crawl started June 2014
web
eye 1.3M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Thu Jul 10 07:24:15 PDT 2014 to Thu Jul 10 01:45:52 PDT 2014.
Topic: crawldata
Wide Crawl started February 2014
web
eye 1.2M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl338.us.archive.org:wide from Sat Feb 22 06:42:19 PST 2014 to Sat Feb 22 01:03:35 PST 2014.
Topic: crawldata
Wide Crawl started June 2014
web
eye 1M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl424.us.archive.org:wide from Tue Jul 1 13:59:42 PDT 2014 to Tue Jul 1 08:25:02 PDT 2014.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl813.us.archive.org:wide from Tue Jun 6 23:57:11 PDT 2017 to Tue Jun 6 17:57:05 PDT 2017.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl422.us.archive.org:wide from Wed Jun 7 00:20:30 PDT 2017 to Tue Jun 6 18:18:45 PDT 2017.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Fri Mar 25 18:07:39 PDT 2016 to Fri Mar 25 13:34:00 PDT 2016.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl808.us.archive.org:wide from Wed Jun 7 01:07:21 PDT 2017 to Tue Jun 6 19:06:13 PDT 2017.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl808.us.archive.org:wide from Tue Jun 6 23:48:51 PDT 2017 to Tue Jun 6 17:43:39 PDT 2017.
Topic: crawldata
Wide Crawl started February 2014
web
eye 994,899
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl417.us.archive.org:wide from Sat Feb 22 09:02:42 PST 2014 to Sat Feb 22 06:09:28 PST 2014.
Topic: crawldata
Wide Crawl started April 2013
web
eye 932,557
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl424.us.archive.org:wide from Wed Jul 24 23:32:46 PDT 2013 to Wed Jul 24 18:16:50 PDT 2013.
Topic: crawldata
Wide Crawl Number 12 - started March, 14th 2015
web
eye 908,232
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Sat Mar 14 20:34:37 PDT 2015 to Sat Mar 14 14:59:03 PDT 2015.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl424.us.archive.org:wide from Tue Aug 16 09:06:32 PDT 2016 to Tue Aug 16 03:02:57 PDT 2016.
Topic: crawldata
Wide Crawl Number 12 - started March, 14th 2015
web
eye 828,541
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl803.us.archive.org:wide from Mon Mar 23 07:02:14 PDT 2015 to Mon Mar 23 01:21:52 PDT 2015.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl424.us.archive.org:wide from Tue May 31 01:31:41 PDT 2016 to Mon May 30 19:52:59 PDT 2016.
Topic: crawldata
Wide Crawl started October 2010
web
eye 797,696
favorite 0
comment 0
Internet Archive crawldata from all sites, captured by ia360919.us.archive.org:wide from Fri Sep 24 20:27:19 UTC 2010 to Sat Sep 25 04:26:09 UTC 2010.
Topic: crawldata
Wide Crawl Number 16: Started June 3rd, 2017 - Still running
web
eye 754,818
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl809.us.archive.org:wide from Tue Jun 6 00:51:50 PDT 2017 to Mon Jun 5 19:34:08 PDT 2017.
Topic: crawldata
Wide Crawl Number 16: Started June 3rd, 2017 - Still running
web
eye 752,251
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl801.us.archive.org:wide from Tue Jun 6 07:52:14 PDT 2017 to Tue Jun 6 03:02:49 PDT 2017.
Topic: crawldata
Wide Crawl Number 16: Started June 3rd, 2017 - Still running
web
eye 750,555
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl423.us.archive.org:wide from Tue Jun 6 06:46:26 PDT 2017 to Tue Jun 6 01:11:01 PDT 2017.
Topic: crawldata
Wide Crawl started April 2013
web
eye 643,477
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl417.us.archive.org:wide from Wed May 15 12:43:53 PDT 2013 to Wed May 15 07:09:54 PDT 2013.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl422.us.archive.org:wide from Fri Mar 25 18:12:44 PDT 2016 to Fri Mar 25 13:23:44 PDT 2016.
Topic: crawldata
Wide Crawl started April 2013
web
eye 636,647
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl415.us.archive.org:wide from Wed May 15 13:19:48 PDT 2013 to Wed May 15 07:48:12 PDT 2013.
Topic: crawldata
Wide Crawl started February 2014
web
eye 631,402
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Mon Feb 10 13:27:36 PST 2014 to Mon Feb 10 08:27:12 PST 2014.
Topic: crawldata
Wide Crawl Number 16: Started June 3rd, 2017 - Still running
web
eye 630,243
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl814.us.archive.org:wide from Wed Jun 7 01:39:13 PDT 2017 to Tue Jun 6 19:25:06 PDT 2017.
Topic: crawldata
Wide Crawl started April 2013
web
eye 606,093
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl418.us.archive.org:wide from Wed Jun 26 16:22:29 PDT 2013 to Wed Jun 26 11:29:54 PDT 2013.
Topic: crawldata
Wide Crawl started February 2014
web
eye 537,509
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl416.us.archive.org:wide from Sat Feb 8 21:10:24 PST 2014 to Sat Feb 8 16:09:44 PST 2014.
Topic: crawldata
Wide Crawl started September 2012
web
eye 533,603
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl337.us.archive.org:wide from Wed Oct 17 08:14:47 PDT 2012 to Wed Oct 17 02:41:59 PDT 2012.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl809.us.archive.org:wide from Mon Aug 1 22:32:39 PDT 2016 to Mon Aug 1 17:39:28 PDT 2016.
Topic: crawldata
Wide Crawl Number 16: Started June 3rd, 2017 - Still running
web
eye 518,492
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Sat Jun 3 20:55:35 PDT 2017 to Sat Jun 3 14:24:26 PDT 2017.
Topic: crawldata
Wide Crawl started April 2013
web
eye 515,929
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl413.us.archive.org:wide from Sun May 12 11:51:10 PDT 2013 to Sun May 12 06:15:36 PDT 2013.
Topic: crawldata
Wide Crawl started February 2014
web
eye 514,823
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl422.us.archive.org:wide from Sat Feb 8 14:04:44 PST 2014 to Sat Feb 8 09:51:54 PST 2014.
Topic: crawldata
Wide Crawl started October 2010
web
eye 491,888
favorite 0
comment 0
Internet Archive crawldata from all sites, captured by ia360905.us.archive.org:wide from Sat Dec 18 18:22:04 UTC 2010 to Sat Dec 18 23:01:04 UTC 2010.
Topic: crawldata
Wide Crawl started April 2013
web
eye 490,801
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl419.us.archive.org:wide from Wed May 15 09:24:17 PDT 2013 to Wed May 15 04:10:53 PDT 2013.
Topic: crawldata
Wide Crawl started April 2013
web
eye 487,830
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl417.us.archive.org:wide from Wed May 15 09:49:58 PDT 2013 to Wed May 15 04:09:39 PDT 2013.
Topic: crawldata
Wide Crawl started January 2012
web
eye 478,133
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl420.us.archive.org:wide from Wed Jan 11 15:00:58 PST 2012 to Wed Jan 11 07:47:03 PST 2012.
Topic: crawldata
Wide Crawl started October 2010
web
eye 476,994
favorite 0
comment 0
Internet Archive crawldata from all sites, captured by ia360905.us.archive.org:wide from Sat Dec 18 19:50:25 UTC 2010 to Sat Dec 18 23:32:37 UTC 2010.
Topic: crawldata
Wide Crawl started April 2013
web
eye 475,575
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl418.us.archive.org:wide from Sun May 12 01:57:26 PDT 2013 to Sat May 11 20:51:14 PDT 2013.
Topic: crawldata
Wide Crawl started April 2013
web
eye 474,312
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl418.us.archive.org:wide from Sat May 11 23:29:50 PDT 2013 to Sat May 11 19:10:10 PDT 2013.
Topic: crawldata
Wide Crawl started April 2013
web
eye 470,095
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl416.us.archive.org:wide from Sat May 11 23:54:41 PDT 2013 to Sat May 11 18:21:46 PDT 2013.
Topic: crawldata
Wide Crawl started January 2012
web
eye 468,997
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl427.us.archive.org:wide from Tue Apr 17 00:58:46 PDT 2012 to Mon Apr 16 20:37:31 PDT 2012.
Topic: crawldata
Wide Crawl Started January 2013
web
eye 457,995
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl336.us.archive.org:wide from Fri Apr 12 05:44:44 PDT 2013 to Fri Apr 12 01:55:44 PDT 2013.
Topic: crawldata